Lenny RAG MCP Server

lenny-rag-mcp
preprocessed

Hamel Husain & Shreya Shankar.json•41.7 KiB

{
  "episode": {
    "guest": "Hamel Husain and Shreya Shankar",
    "expertise_tags": [
      "AI product evaluation",
      "Error analysis",
      "LLM evaluation frameworks",
      "Product management",
      "Data science",
      "AI application testing",
      "Evaluation methodology"
    ],
    "summary": "Hamel Husain and Shreya Shankar, co-creators of the leading evals course on Maven, discuss the critical practice of building evaluations for AI products. They walk through a complete error analysis workflow using a real estate property management AI assistant, demonstrating how to systematically identify failure modes, categorize them, and build both code-based and LLM-as-judge evaluators. The conversation addresses common misconceptions about evals, clarifies the debate around evals versus A-B testing, and provides practical guidance on implementing evaluation systems that help teams iterate and improve their AI products quickly and efficiently.",
    "key_frameworks": [
      "Error analysis",
      "Open coding",
      "Axial coding",
      "Theoretical saturation",
      "Benevolent dictator model",
      "LLM as judge evaluation",
      "Code-based evaluation",
      "Confusion matrix analysis",
      "Data-driven product improvement"
    ]
  },
  "topics": [
    {
      "id": "topic_1",
      "title": "Introduction to Evaluations and Core Concepts",
      "summary": "Lenny introduces the topic of evals as an emerging critical skill for AI product builders, explaining how evals have become one of the most important topics on his podcast. The conversation establishes that building great AI products requires mastery of evals and that this is a new skill that didn't exist two years ago.",
      "timestamp_start": "00:00:00",
      "timestamp_end": "00:02:41",
      "line_start": 1,
      "line_end": 42
    },
    {
      "id": "topic_2",
      "title": "What Are Evals: Fundamental Definition and Mental Models",
      "summary": "Hamel and Shreya define evals as a systematic way to measure and improve AI applications, explaining they are essentially data analytics on LLM applications. They discuss the spectrum of evaluation approaches from unit tests to broad quality metrics, rejecting the narrow framing of evals as just unit tests.",
      "timestamp_start": "00:05:07",
      "timestamp_end": "00:09:56",
      "line_start": 59,
      "line_end": 92
    },
    {
      "id": "topic_3",
      "title": "Error Analysis: Looking at Real Data and Taking Notes",
      "summary": "Hamel walks through the first critical step of building evals using real traces from Nurture Boss, a property management AI assistant. He demonstrates how to examine individual interactions, write informal notes about problems (open coding), and emphasizes the importance of product thinking in identifying what constitutes an error.",
      "timestamp_start": "00:10:06",
      "timestamp_end": "00:30:07",
      "line_start": 94,
      "line_end": 310
    },
    {
      "id": "topic_4",
      "title": "Open Coding and the Benevolent Dictator Pattern",
      "summary": "Shreya and Hamel explain open coding as freeform note-taking about data and introduce the benevolent dictator concept, where a single domain expert leads the categorization process to avoid expensive committee-based decision making. They explain this is crucial for making the process tractable and fast.",
      "timestamp_start": "00:25:13",
      "timestamp_end": "00:28:07",
      "line_start": 239,
      "line_end": 285
    },
    {
      "id": "topic_5",
      "title": "Theoretical Saturation and Sample Sizing",
      "summary": "Shreya introduces the concept of theoretical saturation from qualitative analysis, explaining that the right number of traces to review is when you stop learning new failure modes. They recommend 100 as a mental unblocker but emphasize that 15-60 traces may be sufficient depending on the application.",
      "timestamp_start": "00:30:30",
      "timestamp_end": "00:31:39",
      "line_start": 313,
      "line_end": 336
    },
    {
      "id": "topic_6",
      "title": "From Open Codes to Axial Codes: Synthesizing Failure Modes",
      "summary": "Hamel demonstrates using LLMs to synthesize open codes into axial codes (failure mode categories). He shows how to prompt an LLM to generate categories, then manually refine them to be more specific and actionable, explaining the balance between AI assistance and human judgment.",
      "timestamp_start": "00:31:42",
      "timestamp_end": "00:40:10",
      "line_start": 340,
      "line_end": 465
    },
    {
      "id": "topic_7",
      "title": "Automating Code Categorization with AI",
      "summary": "Hamel shows how to use AI (specifically showing Google Sheets AI and Gemini) to automatically categorize open codes into the refined axial code categories. He emphasizes the importance of detailed open codes so that AI can accurately map them, and demonstrates using a simple spreadsheet formula approach.",
      "timestamp_start": "00:40:59",
      "timestamp_end": "00:44:03",
      "line_start": 467,
      "line_end": 407
    },
    {
      "id": "topic_8",
      "title": "Data Analysis and Pivot Tables: Identifying Priority Problems",
      "summary": "Hamel demonstrates using pivot tables to count and visualize the frequency of different failure modes, allowing teams to identify which problems are most prevalent. This analysis transforms raw observation data into actionable priorities for improvement.",
      "timestamp_start": "00:44:40",
      "timestamp_end": "00:46:53",
      "line_start": 544,
      "line_end": 557
    },
    {
      "id": "topic_9",
      "title": "Code-Based Evals: Building Automated Evaluators",
      "summary": "Shreya explains code-based evals as automated evaluators for failure modes that can be checked via Python functions or code logic, suitable for objective checks like output format, JSON validity, or string matching. These are cheaper and faster than LLM-based approaches.",
      "timestamp_start": "00:48:46",
      "timestamp_end": "00:49:56",
      "line_start": 571,
      "line_end": 596
    },
    {
      "id": "topic_10",
      "title": "LLM as Judge: Building Evaluators for Subjective Failure Modes",
      "summary": "Hamel and Shreya introduce LLM as judge, where an LLM evaluates complex subjective failure modes. They emphasize the critical importance of binary pass/fail judgments, show a complete judge prompt example for handoff issues, and explain how to validate the judge against human judgment.",
      "timestamp_start": "00:52:16",
      "timestamp_end": "01:00:56",
      "line_start": 625,
      "line_end": 680
    },
    {
      "id": "topic_11",
      "title": "Confusion Matrix Analysis: Validating LLM Judges",
      "summary": "Shreya and Hamel explain why simple agreement metrics are misleading and show confusion matrices as the proper way to validate LLM judges. They demonstrate looking at false positives and false negatives separately and iterating on judge prompts to reduce misalignment.",
      "timestamp_start": "00:57:18",
      "timestamp_end": "01:00:56",
      "line_start": 654,
      "line_end": 680
    },
    {
      "id": "topic_12",
      "title": "Evals as Product Requirements: Criteria Drift and Evolving Standards",
      "summary": "The conversation explores how evals function as living product requirement documents, similar to PRDs. Shreya discusses her research on criteria drift, showing that evaluators' standards change as they review more outputs and uncover new failure modes they couldn't have predicted upfront.",
      "timestamp_start": "01:01:45",
      "timestamp_end": "01:05:43",
      "line_start": 685,
      "line_end": 740
    },
    {
      "id": "topic_13",
      "title": "Number of LLM Judge Evals and Cost-Benefit Analysis",
      "summary": "Shreya explains that most products need only 4-7 LLM judge evals, not comprehensive coverage. She emphasizes prioritizing only the pesky failure modes that can't be fixed by prompt changes, understanding the cost-benefit tradeoff of building evals.",
      "timestamp_start": "01:05:19",
      "timestamp_end": "01:06:21",
      "line_start": 727,
      "line_end": 752
    },
    {
      "id": "topic_14",
      "title": "Data Analysis Sophistication and Ongoing Improvements",
      "summary": "Hamel discusses how teams can get more sophisticated with data analysis beyond simple counting, using various sampling techniques and data exploration methods. He emphasizes that this resembles traditional analytics but applied to LLM applications.",
      "timestamp_start": "01:06:30",
      "timestamp_end": "01:08:41",
      "line_start": 754,
      "line_end": 763
    },
    {
      "id": "topic_15",
      "title": "Using Evals in Production: Unit Tests and Online Monitoring",
      "summary": "Shreya explains how evals move beyond development and into production through unit tests and continuous online monitoring. She discusses how teams build dashboards and use these metrics as competitive advantages (moats) they don't share publicly.",
      "timestamp_start": "01:07:48",
      "timestamp_end": "01:08:41",
      "line_start": 760,
      "line_end": 765
    },
    {
      "id": "topic_16",
      "title": "Misconceptions About Evals: The Debate and Nuance",
      "summary": "The conversation addresses the significant controversy around evals on social media, with Shreya explaining that much of the debate stems from narrow definitions of evals and from people who were burned by poorly implemented LLM judges. She contextualizes that many disagreements come from different understandings of what evals include.",
      "timestamp_start": "01:10:19",
      "timestamp_end": "01:15:40",
      "line_start": 775,
      "line_end": 812
    },
    {
      "id": "topic_17",
      "title": "Claude Code and the Vibes-Based Development Debate",
      "summary": "The conversation discusses the apparent contradiction between Claude Code's success without explicit evals and the argument that evals are critical. Shreya argues that Claude Code is built on the foundation of extensive evals done on Claude's base models and is likely doing implicit error analysis.",
      "timestamp_start": "01:12:24",
      "timestamp_end": "01:14:35",
      "line_start": 781,
      "line_end": 803
    },
    {
      "id": "topic_18",
      "title": "Evals Versus A-B Testing: Complementary Not Competing",
      "summary": "Shreya and Hamel explain that evals and A-B tests are complementary parts of a data science toolkit, not opposing approaches. They argue that A-B tests should be grounded in error analysis rather than hypothesis-driven without ground truth, and that evals represent the broader data science thinking needed for AI products.",
      "timestamp_start": "01:16:38",
      "timestamp_end": "01:19:50",
      "line_start": 820,
      "line_end": 833
    },
    {
      "id": "topic_19",
      "title": "Common Misconceptions About Evals",
      "summary": "Hamel and Shreya identify the top misconceptions: (1) that AI tools can automatically do evals without human judgment, (2) that teams skip looking at actual data, and (3) that there's only one correct way to do evals. They emphasize the importance of human involvement and domain expertise.",
      "timestamp_start": "01:24:31",
      "timestamp_end": "01:26:28",
      "line_start": 877,
      "line_end": 900
    },
    {
      "id": "topic_20",
      "title": "Practical Tips for Starting and Improving Evals",
      "summary": "Shreya and Hamel provide actionable advice: don't be scared of the process, use AI to help organize thinking throughout, create custom tools to remove friction from data exploration, and remember the goal is actionable product improvement, not perfect evals.",
      "timestamp_start": "01:26:37",
      "timestamp_end": "01:29:56",
      "line_start": 904,
      "line_end": 923
    },
    {
      "id": "topic_21",
      "title": "Building Custom Interfaces for Error Analysis",
      "summary": "Hamel shows an example of a custom web application built for Nurture Boss to make data exploration frictionless, with features like channel filtering and visual error counts. He explains how AI makes building such tools accessible and cost-effective.",
      "timestamp_start": "01:28:22",
      "timestamp_end": "01:29:56",
      "line_start": 920,
      "line_end": 923
    },
    {
      "id": "topic_22",
      "title": "Time Investment and ROI of Evals",
      "summary": "Shreya shares that initial error analysis typically takes 3-4 days of focused work, then becomes a simple 30-minute weekly maintenance task. This one-time investment yields significant ongoing returns through continuous product improvement.",
      "timestamp_start": "01:30:45",
      "timestamp_end": "01:31:56",
      "line_start": 934,
      "line_end": 944
    },
    {
      "id": "topic_23",
      "title": "The Fun and Iterative Nature of Evals Work",
      "summary": "Hamel shares an anecdote about discovering generic recruiting email language through data review, emphasizing the enjoyment of putting on a product hat and critically evaluating AI outputs. He illustrates how evals work is intellectually engaging and immediately valuable.",
      "timestamp_start": "01:32:06",
      "timestamp_end": "01:33:38",
      "line_start": 949,
      "line_end": 954
    },
    {
      "id": "topic_24",
      "title": "Course Curriculum and Learning Resources",
      "summary": "Shreya and Hamel describe their Maven course curriculum covering error analysis, automated evaluators, application improvement flywheel, custom interface building, and cost optimization. They highlight the 160-page book, active Discord community, and AI bot with 10 months free access for students.",
      "timestamp_start": "01:33:51",
      "timestamp_end": "01:37:26",
      "line_start": 958,
      "line_end": 1001
    },
    {
      "id": "topic_25",
      "title": "Lightning Round: Books, Media, Products, and Philosophies",
      "summary": "Hamel and Shreya share personal recommendations including books (Pachinko, Apple in China, Machine Learning by Mitchell), entertainment (Frozen, The Wire), favorite tools (Claude Code, Cursor), and life philosophies (think like a beginner, understand the other side's argument).",
      "timestamp_start": "01:38:04",
      "timestamp_end": "01:44:06",
      "line_start": 1015,
      "line_end": 1122
    }
  ],
  "insights": [
    {
      "id": "insight_1",
      "text": "To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.",
      "context": "Opening statement setting the critical importance of evals for AI product development",
      "topic_id": "topic_1",
      "line_start": 1,
      "line_end": 2
    },
    {
      "id": "insight_2",
      "text": "The goal is not to do evals perfectly, it's to actionably improve your product.",
      "context": "Core principle emphasizing the practical purpose of evaluations over theoretical perfection",
      "topic_id": "topic_1",
      "line_start": 10,
      "line_end": 11
    },
    {
      "id": "insight_3",
      "text": "Evals is a way to systematically measure and improve an AI application, and at its core, it's data analytics on your LLM application.",
      "context": "Definition of evals as data-driven product analytics rather than traditional testing",
      "topic_id": "topic_2",
      "line_start": 61,
      "line_end": 62
    },
    {
      "id": "insight_4",
      "text": "Unit tests are a very small part of that very big puzzle. Evals could be data analysis to find new cohorts, metrics tracked over time, or basic user feedback metrics.",
      "context": "Explaining the spectrum of evaluation approaches beyond code-based unit tests",
      "topic_id": "topic_2",
      "line_start": 85,
      "line_end": 89
    },
    {
      "id": "insight_5",
      "text": "Before evals, you would be left with guessing. You might fix a prompt and hope you're not breaking anything else, relying on vibe checks. As applications grow, vibe checks become unmanageable.",
      "context": "Problem statement explaining why systematic evals are necessary",
      "topic_id": "topic_2",
      "line_start": 68,
      "line_end": 71
    },
    {
      "id": "insight_6",
      "text": "Don't jump straight to writing tests. You should start with data analysis to ground what you should even test. With LLMs, there's a lot more surface area and stochasticity.",
      "context": "Cautioning against the common mistake of jumping directly to test writing without understanding failure modes first",
      "topic_id": "topic_3",
      "line_start": 94,
      "line_end": 95
    },
    {
      "id": "insight_7",
      "text": "You need to put your product hat on because product people understand the user experience. A developer might not see why something is wrong if it technically works.",
      "context": "Explaining why product managers or domain experts must lead error analysis, not just engineers",
      "topic_id": "topic_3",
      "line_start": 145,
      "line_end": 149
    },
    {
      "id": "insight_8",
      "text": "When you're doing data analysis of your LLM application, write down the first thing that's wrong—the most upstream error. Don't try to find all errors. Just capture the first thing and stop.",
      "context": "Guidance on efficient error analysis methodology",
      "topic_id": "topic_3",
      "line_start": 203,
      "line_end": 203
    },
    {
      "id": "insight_9",
      "text": "An LLM looking at a trace without context will say it's fine when it's actually hallucinating or missing product expectations. You can't automate error analysis with an LLM at this stage.",
      "context": "Demonstrating the limitation of AI in error analysis without human domain expertise",
      "topic_id": "topic_3",
      "line_start": 220,
      "line_end": 230
    },
    {
      "id": "insight_10",
      "text": "When doing open coding, a lot of teams get bogged down having a committee. You can appoint one person whose taste you trust—a benevolent dictator. You don't want to make this process so expensive that you can't do it.",
      "context": "Introducing the benevolent dictator pattern to reduce decision-making overhead",
      "topic_id": "topic_4",
      "line_start": 259,
      "line_end": 260
    },
    {
      "id": "insight_11",
      "text": "The benevolent dictator should be the person with domain expertise. For legal matters, a lawyer. For mental health, a mental health expert. Oftentimes, it's the product manager.",
      "context": "Guidance on who should lead the evaluation process",
      "topic_id": "topic_4",
      "line_start": 271,
      "line_end": 278
    },
    {
      "id": "insight_12",
      "text": "Keep looking at traces until you feel like you're not learning anything new. This is called theoretical saturation—when you're not uncovering any new failure modes or concepts.",
      "context": "Explaining the natural stopping point for error analysis based on learning curve",
      "topic_id": "topic_5",
      "line_start": 311,
      "line_end": 320
    },
    {
      "id": "insight_13",
      "text": "After doing 20 traces, you will automatically find it so useful that you will continue doing it. The intuition for when to stop develops over 2-3 rounds.",
      "context": "Practical guidance that the process becomes addictive and intuitive",
      "topic_id": "topic_5",
      "line_start": 308,
      "line_end": 323
    },
    {
      "id": "insight_14",
      "text": "When using LLMs to create axial codes, you can be very detailed in your prompt about what you want. Tell it you want actionable failure modes or to group by user story stage. There's no definitive way to do it.",
      "context": "Emphasizing the flexibility and iterative nature of using AI in the eval process",
      "topic_id": "topic_6",
      "line_start": 394,
      "line_end": 407
    },
    {
      "id": "insight_15",
      "text": "Error analysis is grounded in social science concepts like open coding and axial coding that have been around for a long time. We didn't invent this; we adapted it for LLMs.",
      "context": "Clarifying that evals methodology builds on established research traditions",
      "topic_id": "topic_6",
      "line_start": 425,
      "line_end": 428
    },
    {
      "id": "insight_16",
      "text": "Your open codes have to be detailed. You can't just say 'janky.' If an AI is reading it, it won't be able to categorize properly. Even a human would have to remember why you said janky.",
      "context": "Practical lesson on the importance of detailed documentation during error analysis",
      "topic_id": "topic_7",
      "line_start": 485,
      "line_end": 485
    },
    {
      "id": "insight_17",
      "text": "Basic counting is the most powerful analytical technique in data science because it's simple and it's undervalued. It takes chaotic observations and gives you actionable priorities.",
      "context": "Highlighting the power of simple quantitative analysis over complex methods",
      "topic_id": "topic_8",
      "line_start": 353,
      "line_end": 353
    },
    {
      "id": "insight_18",
      "text": "Not all failure modes need evals. Some are just dumb engineering errors you can fix directly. You need to make a cost-benefit tradeoff and don't want to get carried away with evals.",
      "context": "Practical guidance on deciding which failure modes warrant evaluation infrastructure",
      "topic_id": "topic_8",
      "line_start": 551,
      "line_end": 557
    },
    {
      "id": "insight_19",
      "text": "Code-based evals are cheaper and faster to build when possible. You should try to do code-based evals if you can before turning to LLM judges.",
      "context": "Cost-optimization principle for eval infrastructure",
      "topic_id": "topic_9",
      "line_start": 562,
      "line_end": 564
    },
    {
      "id": "insight_20",
      "text": "An LLM judge should output binary pass/fail, not a rating scale. A score of 3.2 versus 3.7 is meaningless and prevents decision-making. Force yourself to make a decision.",
      "context": "Critical principle for LLM judge design",
      "topic_id": "topic_10",
      "line_start": 625,
      "line_end": 629
    },
    {
      "id": "insight_21",
      "text": "If someone reports LLM judge agreement percentage without a confusion matrix, that's a red flag. Rare errors can give high agreement by chance. You must examine false positives and false negatives separately.",
      "context": "Warning against misleading evaluation metrics",
      "topic_id": "topic_11",
      "line_start": 662,
      "line_end": 675
    },
    {
      "id": "insight_22",
      "text": "Your rubrics can't be figured out upfront. People's opinions of good and bad change as they review more outputs. You can't dream up all failure modes at the beginning.",
      "context": "Fundamental insight about the emergent nature of evaluation criteria for AI systems",
      "topic_id": "topic_12",
      "line_start": 716,
      "line_end": 716
    },
    {
      "id": "insight_23",
      "text": "Evals function exactly like PRDs—they specify what the product should do. What's different is that evals are derived from actual user data and failures, making them grounded in reality.",
      "context": "Connecting evals to traditional product management practices",
      "topic_id": "topic_12",
      "line_start": 686,
      "line_end": 692
    },
    {
      "id": "insight_24",
      "text": "Most products need 4-7 LLM judge evals, not comprehensive coverage. Many failure modes can be fixed by improving the prompt. Don't write evals for everything.",
      "context": "Practical guidance on the typical scope of evaluation infrastructure",
      "topic_id": "topic_13",
      "line_start": 728,
      "line_end": 734
    },
    {
      "id": "insight_25",
      "text": "People often do A-B tests prematurely without first doing error analysis. They hypothesize what's wrong without grounding in data. Ground your hypotheses in error analysis first.",
      "context": "Explaining the proper sequence of analysis before experimentation",
      "topic_id": "topic_18",
      "line_start": 826,
      "line_end": 827
    },
    {
      "id": "insight_26",
      "text": "The debate between evals and A-B testing is really about data science thinking. You need multiple tools in your toolkit. Evals and A-B tests are complementary, not competing.",
      "context": "Reframing the apparent conflict in AI product development practices",
      "topic_id": "topic_18",
      "line_start": 832,
      "line_end": 833
    },
    {
      "id": "insight_27",
      "text": "When Claude Code engineers say they don't do evals, they're standing on the shoulders of evals done on Claude's base models. They're also implicitly doing error analysis through dogfooding.",
      "context": "Contextualizing the apparent contradiction between successful AI products and explicit eval processes",
      "topic_id": "topic_17",
      "line_start": 785,
      "line_end": 794
    },
    {
      "id": "insight_28",
      "text": "Coding agents are fundamentally different than other AI products because the developer is the domain expert and user. You can collapse activities and don't need as much data and feedback.",
      "context": "Explaining why evaluation approaches vary by product type",
      "topic_id": "topic_17",
      "line_start": 796,
      "line_end": 797
    },
    {
      "id": "insight_29",
      "text": "The biggest misconception is thinking you can buy a tool that does evals for you. It doesn't work. LLMs can't do error analysis without human judgment and domain expertise.",
      "context": "Top misconception about evals automation",
      "topic_id": "topic_19",
      "line_start": 877,
      "line_end": 878
    },
    {
      "id": "insight_30",
      "text": "The second major misconception is people not looking at their actual data. The most powerful activity is going to look at individual traces and understand what's happening.",
      "context": "Emphasizing the importance of direct data examination",
      "topic_id": "topic_19",
      "line_start": 883,
      "line_end": 884
    },
    {
      "id": "insight_31",
      "text": "There's no one correct way to do evals, but there are many incorrect ways. You have to think about where you are with your product and resources, and figure out the plan that works for you.",
      "context": "Acknowledging the flexible but not arbitrary nature of eval implementation",
      "topic_id": "topic_19",
      "line_start": 898,
      "line_end": 899
    },
    {
      "id": "insight_32",
      "text": "Don't be scared of looking at your data. The goal is not to do evals perfectly, it's to actionably improve your product. You're guaranteed to find improvements if you do any part of this process.",
      "context": "Encouraging mindset for teams starting their eval journey",
      "topic_id": "topic_20",
      "line_start": 904,
      "line_end": 905
    },
    {
      "id": "insight_33",
      "text": "Use LLMs to help you throughout the entire eval process—organizing thoughts, improving PRDs based on open codes, presenting information better. But don't use them to replace yourself.",
      "context": "Balanced perspective on AI assistance in evaluation work",
      "topic_id": "topic_20",
      "line_start": 908,
      "line_end": 914
    },
    {
      "id": "insight_34",
      "text": "Build your own tools to make it as easy as possible to look at data. With AI, you can create simple web applications to remove all friction from data exploration in a few hours.",
      "context": "Practical tip for making evaluation infrastructure accessible",
      "topic_id": "topic_21",
      "line_start": 920,
      "line_end": 923
    },
    {
      "id": "insight_35",
      "text": "Initial error analysis takes 3-4 days of focused work, but after that, it becomes 30 minutes per week. This is a one-time cost that yields ongoing returns.",
      "context": "Time investment analysis showing favorable ROI of evals",
      "topic_id": "topic_22",
      "line_start": 934,
      "line_end": 938
    },
    {
      "id": "insight_36",
      "text": "This process is a lot of fun. Everyone that does error analysis immediately gets addicted to it. You learn so much by putting your product hat on and looking at real interactions.",
      "context": "Positive reinforcement about the engaging nature of eval work",
      "topic_id": "topic_23",
      "line_start": 5,
      "line_end": 5
    }
  ],
  "examples": [
    {
      "id": "example_1",
      "explicit_text": "At Anthropic and OpenAI, both CPOs shared that evals are becoming the most important new skill for product builders.",
      "inferred_identity": "Dario Amodei (Anthropic) / Sam Altman (OpenAI)",
      "confidence": 0.85,
      "tags": [
        "Anthropic",
        "OpenAI",
        "CPO",
        "chief product officer",
        "evals",
        "AI product",
        "skill development",
        "industry leaders"
      ],
      "lesson": "Evals have emerged as a critical competency that is recognized and prioritized at the highest levels of leading AI companies.",
      "topic_id": "topic_1",
      "line_start": 32,
      "line_end": 32
    },
    {
      "id": "example_2",
      "explicit_text": "Real estate assistant application from Nurture Boss, a company that AI assists property managers with inbound leads, customer service, booking appointments, and operations.",
      "inferred_identity": "Nurture Boss",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "property management",
        "real estate",
        "AI assistant",
        "customer service",
        "operations",
        "tool calling",
        "RAG",
        "voice chat",
        "text messaging"
      ],
      "lesson": "Complex AI applications with multiple channels, tool integrations, and external data sources require systematic error analysis to identify failures across many dimensions.",
      "topic_id": "topic_3",
      "line_start": 98,
      "line_end": 101
    },
    {
      "id": "example_3",
      "explicit_text": "User asked about one-bedroom with study availability. The AI said we have one-bedroom apartments but none with study, then when asked when one with study would be available, said it doesn't have specific information.",
      "inferred_identity": "Nurture Boss property management example",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "hallucination",
        "incomplete response",
        "customer service failure",
        "real estate",
        "availability query",
        "product mistake"
      ],
      "lesson": "AI systems can technically process information correctly but fail to provide the helpful next step that a human would offer, requiring a handoff to human agents.",
      "topic_id": "topic_3",
      "line_start": 128,
      "line_end": 137
    },
    {
      "id": "example_4",
      "explicit_text": "Text message interaction where customer wrote 'Okay, I've been texting you all day. Please.' The system received garbled input because text messages are short phrases split across multiple turns.",
      "inferred_identity": "Nurture Boss text messaging feature",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "text messaging",
        "channel issue",
        "technical error",
        "conversation fragmentation",
        "multi-turn handling"
      ],
      "lesson": "Different communication channels (text vs. voice) require different handling logic. Text messaging's characteristic pattern of short, fragmented messages needs special consideration.",
      "topic_id": "topic_3",
      "line_start": 169,
      "line_end": 173
    },
    {
      "id": "example_5",
      "explicit_text": "User asked 'Do you have a one-bedroom with study available?' and 'Do you provide virtual tours?' The system responded 'We do offer virtual tours. You can schedule a tour,' but there was no virtual tour functionality.",
      "inferred_identity": "Nurture Boss hallucination example",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "hallucination",
        "false capability",
        "virtual tours",
        "real estate",
        "dangerous error",
        "feature claiming"
      ],
      "lesson": "Hallucinations about product capabilities are particularly dangerous in customer-facing applications because they create false expectations and erode trust.",
      "topic_id": "topic_3",
      "line_start": 206,
      "line_end": 209
    },
    {
      "id": "example_6",
      "explicit_text": "User asked about specials, received 5% military discount, then asked about floor availability and one-bedroom availability. The AI called transfer call without confirming with the user.",
      "inferred_identity": "Nurture Boss call transfer failure",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "handoff issue",
        "abrupt transfer",
        "poor UX",
        "customer service",
        "process violation"
      ],
      "lesson": "Transferring to humans should be explicit and customer-initiated when possible, not abrupt and unannounced, to maintain customer experience quality.",
      "topic_id": "topic_3",
      "line_start": 289,
      "line_end": 293
    },
    {
      "id": "example_7",
      "explicit_text": "Claude was used to analyze a CSV of open codes and create axial codes, automatically categorizing failure modes into groups like 'capability limitations', 'misrepresentation', 'process violations', 'handoff issues', 'communication quality'.",
      "inferred_identity": "Claude API / Claude model",
      "confidence": 0.95,
      "tags": [
        "Claude",
        "LLM",
        "data analysis",
        "categorization",
        "axial coding",
        "text analysis",
        "automation",
        "Anthropic"
      ],
      "lesson": "LLMs are highly effective at synthesizing open codes into organized categories, but the results still require human review and refinement for actionability.",
      "topic_id": "topic_6",
      "line_start": 356,
      "line_end": 383
    },
    {
      "id": "example_8",
      "explicit_text": "Andrew Ng, a famous machine learning researcher, is shown in an 8-year-old video discussing error analysis as a technique for analyzing stochastic systems.",
      "inferred_identity": "Andrew Ng",
      "confidence": 1.0,
      "tags": [
        "Andrew Ng",
        "Stanford",
        "machine learning",
        "error analysis",
        "research",
        "stochastic systems",
        "foundational technique"
      ],
      "lesson": "Error analysis is a well-established research technique from machine learning that has direct applicability to modern LLM systems.",
      "topic_id": "topic_6",
      "line_start": 440,
      "line_end": 443
    },
    {
      "id": "example_9",
      "explicit_text": "Using Google Sheets AI to automatically categorize open codes into axial code buckets like 'tour scheduling', 'human handoff issues', 'formatting errors', 'conversational flow'.",
      "inferred_identity": "Google Sheets AI functionality",
      "confidence": 0.95,
      "tags": [
        "Google Sheets",
        "Google AI",
        "spreadsheet automation",
        "categorization",
        "Gemini",
        "accessible tools"
      ],
      "lesson": "Spreadsheet-based AI assistance makes advanced data analysis accessible to product teams without requiring specialized technical infrastructure.",
      "topic_id": "topic_7",
      "line_start": 472,
      "line_end": 482
    },
    {
      "id": "example_10",
      "explicit_text": "Pivot table showing: 17 conversational flow issues, multiple human handoff issues, formatting errors with output, making follow-up promises not kept.",
      "inferred_identity": "Nurture Boss error frequency analysis",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "data analysis",
        "pivot table",
        "error quantification",
        "prioritization",
        "failure modes"
      ],
      "lesson": "Visualizing error frequency through simple counting transforms qualitative observations into actionable priorities for product improvement.",
      "topic_id": "topic_8",
      "line_start": 545,
      "line_end": 551
    },
    {
      "id": "example_11",
      "explicit_text": "LLM judge prompt for evaluating whether an AI should hand off to a human, checking against specific criteria like explicit human requests, policy-mandated transfers, sensitive issues, tool data unavailability.",
      "inferred_identity": "Nurture Boss handoff eval",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "LLM judge",
        "handoff evaluation",
        "binary decision",
        "property management",
        "customer service"
      ],
      "lesson": "Well-crafted LLM judge prompts make subjective quality judgments actionable by defining specific, clear criteria for pass/fail decisions.",
      "topic_id": "topic_10",
      "line_start": 647,
      "line_end": 650
    },
    {
      "id": "example_12",
      "explicit_text": "Confusion matrix analysis showing human judgment vs. LLM judge judgment, with focus on false positives and false negatives rather than overall agreement percentage.",
      "inferred_identity": "Nurture Boss eval validation",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "LLM judge",
        "validation",
        "confusion matrix",
        "error types",
        "evaluation quality"
      ],
      "lesson": "Confusion matrices provide much more actionable feedback than overall agreement percentages, revealing which types of errors the judge is making.",
      "topic_id": "topic_11",
      "line_start": 670,
      "line_end": 675
    },
    {
      "id": "example_13",
      "explicit_text": "Shreya's research paper titled 'Who Validates the Validated?' showed that people's evaluation criteria drift and change as they review more outputs, especially with LLM systems.",
      "inferred_identity": "Shreya Shankar research",
      "confidence": 1.0,
      "tags": [
        "Shreya Shankar",
        "research",
        "academia",
        "criteria drift",
        "evaluation methodology",
        "user study",
        "LLM evaluation"
      ],
      "lesson": "Evaluators naturally refine their standards and discover new failure modes through exposure to real outputs, making upfront rubrics necessarily incomplete.",
      "topic_id": "topic_12",
      "line_start": 698,
      "line_end": 716
    },
    {
      "id": "example_14",
      "explicit_text": "Claude Code is reportedly built on vibes and doesn't explicitly do evals, yet it's highly successful, sparking debate about whether evals are necessary.",
      "inferred_identity": "Claude Code / Anthropic",
      "confidence": 0.85,
      "tags": [
        "Claude Code",
        "Anthropic",
        "coding agent",
        "vibes-based development",
        "implicit evals",
        "debate"
      ],
      "lesson": "Successful AI products with expert domain user-developers may succeed through implicit evals and intensive dogfooding, but this doesn't generalize to all product types.",
      "topic_id": "topic_17",
      "line_start": 781,
      "line_end": 794
    },
    {
      "id": "example_15",
      "explicit_text": "OpenAI has gone to lengths to analyze Twitter sentiment and Reddit threads complaining about their products and tied that back to improvements, demonstrating the breadth of evals thinking.",
      "inferred_identity": "OpenAI",
      "confidence": 0.9,
      "tags": [
        "OpenAI",
        "GPT",
        "public sentiment",
        "social listening",
        "product feedback",
        "data analysis"
      ],
      "lesson": "Evals thinking extends beyond automated metrics to include synthesis of public feedback and sentiment to inform product direction.",
      "topic_id": "topic_18",
      "line_start": 845,
      "line_end": 848
    },
    {
      "id": "example_16",
      "explicit_text": "A recruiting email being sent with generic opening 'Given your background, blah blah blah.' When examined critically, it's an obvious generic recruiting email that would be deleted by recipients.",
      "inferred_identity": "Email generation AI (Hamel's consulting client)",
      "confidence": 0.7,
      "tags": [
        "email assistant",
        "recruiting",
        "outreach",
        "generic messaging",
        "product failure",
        "UX review"
      ],
      "lesson": "Putting on a product hat and critically examining individual outputs reveals failures that automated metrics might miss, like tone-deaf or generic messaging.",
      "topic_id": "topic_23",
      "line_start": 950,
      "line_end": 953
    },
    {
      "id": "example_17",
      "explicit_text": "Hamel and Shreya teach a course on Maven that is the number one highest-grossing course on the platform, with 2,000+ PMs and engineers trained across 500 companies.",
      "inferred_identity": "Maven course platform",
      "confidence": 1.0,
      "tags": [
        "Maven",
        "online education",
        "evals course",
        "popular course",
        "product management",
        "engineering education"
      ],
      "lesson": "There is significant demand for structured education on evals, indicating this is a critical emerging skill that organizations are actively trying to develop.",
      "topic_id": "topic_24",
      "line_start": 35,
      "line_end": 35
    },
    {
      "id": "example_18",
      "explicit_text": "The course includes 160-page book, custom AI bot with 10 months free access, live coding interfaces for error analysis, cost optimization techniques, and active Discord community.",
      "inferred_identity": "Hamel and Shreya's Maven course",
      "confidence": 1.0,
      "tags": [
        "Maven course",
        "educational materials",
        "AI bot",
        "community",
        "practical training",
        "comprehensive curriculum"
      ],
      "lesson": "Effective AI education now integrates multiple learning modalities including documentation, AI assistants, live coding, and community support.",
      "topic_id": "topic_24",
      "line_start": 964,
      "line_end": 980
    }
  ]
}

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/mpnikhil/lenny-rag-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

Hamel Husain & Shreya Shankar.json•41.7 KiB

{
  "episode": {
    "guest": "Hamel Husain and Shreya Shankar",
    "expertise_tags": [
      "AI product evaluation",
      "Error analysis",
      "LLM evaluation frameworks",
      "Product management",
      "Data science",
      "AI application testing",
      "Evaluation methodology"
    ],
    "summary": "Hamel Husain and Shreya Shankar, co-creators of the leading evals course on Maven, discuss the critical practice of building evaluations for AI products. They walk through a complete error analysis workflow using a real estate property management AI assistant, demonstrating how to systematically identify failure modes, categorize them, and build both code-based and LLM-as-judge evaluators. The conversation addresses common misconceptions about evals, clarifies the debate around evals versus A-B testing, and provides practical guidance on implementing evaluation systems that help teams iterate and improve their AI products quickly and efficiently.",
    "key_frameworks": [
      "Error analysis",
      "Open coding",
      "Axial coding",
      "Theoretical saturation",
      "Benevolent dictator model",
      "LLM as judge evaluation",
      "Code-based evaluation",
      "Confusion matrix analysis",
      "Data-driven product improvement"
    ]
  },
  "topics": [
    {
      "id": "topic_1",
      "title": "Introduction to Evaluations and Core Concepts",
      "summary": "Lenny introduces the topic of evals as an emerging critical skill for AI product builders, explaining how evals have become one of the most important topics on his podcast. The conversation establishes that building great AI products requires mastery of evals and that this is a new skill that didn't exist two years ago.",
      "timestamp_start": "00:00:00",
      "timestamp_end": "00:02:41",
      "line_start": 1,
      "line_end": 42
    },
    {
      "id": "topic_2",
      "title": "What Are Evals: Fundamental Definition and Mental Models",
      "summary": "Hamel and Shreya define evals as a systematic way to measure and improve AI applications, explaining they are essentially data analytics on LLM applications. They discuss the spectrum of evaluation approaches from unit tests to broad quality metrics, rejecting the narrow framing of evals as just unit tests.",
      "timestamp_start": "00:05:07",
      "timestamp_end": "00:09:56",
      "line_start": 59,
      "line_end": 92
    },
    {
      "id": "topic_3",
      "title": "Error Analysis: Looking at Real Data and Taking Notes",
      "summary": "Hamel walks through the first critical step of building evals using real traces from Nurture Boss, a property management AI assistant. He demonstrates how to examine individual interactions, write informal notes about problems (open coding), and emphasizes the importance of product thinking in identifying what constitutes an error.",
      "timestamp_start": "00:10:06",
      "timestamp_end": "00:30:07",
      "line_start": 94,
      "line_end": 310
    },
    {
      "id": "topic_4",
      "title": "Open Coding and the Benevolent Dictator Pattern",
      "summary": "Shreya and Hamel explain open coding as freeform note-taking about data and introduce the benevolent dictator concept, where a single domain expert leads the categorization process to avoid expensive committee-based decision making. They explain this is crucial for making the process tractable and fast.",
      "timestamp_start": "00:25:13",
      "timestamp_end": "00:28:07",
      "line_start": 239,
      "line_end": 285
    },
    {
      "id": "topic_5",
      "title": "Theoretical Saturation and Sample Sizing",
      "summary": "Shreya introduces the concept of theoretical saturation from qualitative analysis, explaining that the right number of traces to review is when you stop learning new failure modes. They recommend 100 as a mental unblocker but emphasize that 15-60 traces may be sufficient depending on the application.",
      "timestamp_start": "00:30:30",
      "timestamp_end": "00:31:39",
      "line_start": 313,
      "line_end": 336
    },
    {
      "id": "topic_6",
      "title": "From Open Codes to Axial Codes: Synthesizing Failure Modes",
      "summary": "Hamel demonstrates using LLMs to synthesize open codes into axial codes (failure mode categories). He shows how to prompt an LLM to generate categories, then manually refine them to be more specific and actionable, explaining the balance between AI assistance and human judgment.",
      "timestamp_start": "00:31:42",
      "timestamp_end": "00:40:10",
      "line_start": 340,
      "line_end": 465
    },
    {
      "id": "topic_7",
      "title": "Automating Code Categorization with AI",
      "summary": "Hamel shows how to use AI (specifically showing Google Sheets AI and Gemini) to automatically categorize open codes into the refined axial code categories. He emphasizes the importance of detailed open codes so that AI can accurately map them, and demonstrates using a simple spreadsheet formula approach.",
      "timestamp_start": "00:40:59",
      "timestamp_end": "00:44:03",
      "line_start": 467,
      "line_end": 407
    },
    {
      "id": "topic_8",
      "title": "Data Analysis and Pivot Tables: Identifying Priority Problems",
      "summary": "Hamel demonstrates using pivot tables to count and visualize the frequency of different failure modes, allowing teams to identify which problems are most prevalent. This analysis transforms raw observation data into actionable priorities for improvement.",
      "timestamp_start": "00:44:40",
      "timestamp_end": "00:46:53",
      "line_start": 544,
      "line_end": 557
    },
    {
      "id": "topic_9",
      "title": "Code-Based Evals: Building Automated Evaluators",
      "summary": "Shreya explains code-based evals as automated evaluators for failure modes that can be checked via Python functions or code logic, suitable for objective checks like output format, JSON validity, or string matching. These are cheaper and faster than LLM-based approaches.",
      "timestamp_start": "00:48:46",
      "timestamp_end": "00:49:56",
      "line_start": 571,
      "line_end": 596
    },
    {
      "id": "topic_10",
      "title": "LLM as Judge: Building Evaluators for Subjective Failure Modes",
      "summary": "Hamel and Shreya introduce LLM as judge, where an LLM evaluates complex subjective failure modes. They emphasize the critical importance of binary pass/fail judgments, show a complete judge prompt example for handoff issues, and explain how to validate the judge against human judgment.",
      "timestamp_start": "00:52:16",
      "timestamp_end": "01:00:56",
      "line_start": 625,
      "line_end": 680
    },
    {
      "id": "topic_11",
      "title": "Confusion Matrix Analysis: Validating LLM Judges",
      "summary": "Shreya and Hamel explain why simple agreement metrics are misleading and show confusion matrices as the proper way to validate LLM judges. They demonstrate looking at false positives and false negatives separately and iterating on judge prompts to reduce misalignment.",
      "timestamp_start": "00:57:18",
      "timestamp_end": "01:00:56",
      "line_start": 654,
      "line_end": 680
    },
    {
      "id": "topic_12",
      "title": "Evals as Product Requirements: Criteria Drift and Evolving Standards",
      "summary": "The conversation explores how evals function as living product requirement documents, similar to PRDs. Shreya discusses her research on criteria drift, showing that evaluators' standards change as they review more outputs and uncover new failure modes they couldn't have predicted upfront.",
      "timestamp_start": "01:01:45",
      "timestamp_end": "01:05:43",
      "line_start": 685,
      "line_end": 740
    },
    {
      "id": "topic_13",
      "title": "Number of LLM Judge Evals and Cost-Benefit Analysis",
      "summary": "Shreya explains that most products need only 4-7 LLM judge evals, not comprehensive coverage. She emphasizes prioritizing only the pesky failure modes that can't be fixed by prompt changes, understanding the cost-benefit tradeoff of building evals.",
      "timestamp_start": "01:05:19",
      "timestamp_end": "01:06:21",
      "line_start": 727,
      "line_end": 752
    },
    {
      "id": "topic_14",
      "title": "Data Analysis Sophistication and Ongoing Improvements",
      "summary": "Hamel discusses how teams can get more sophisticated with data analysis beyond simple counting, using various sampling techniques and data exploration methods. He emphasizes that this resembles traditional analytics but applied to LLM applications.",
      "timestamp_start": "01:06:30",
      "timestamp_end": "01:08:41",
      "line_start": 754,
      "line_end": 763
    },
    {
      "id": "topic_15",
      "title": "Using Evals in Production: Unit Tests and Online Monitoring",
      "summary": "Shreya explains how evals move beyond development and into production through unit tests and continuous online monitoring. She discusses how teams build dashboards and use these metrics as competitive advantages (moats) they don't share publicly.",
      "timestamp_start": "01:07:48",
      "timestamp_end": "01:08:41",
      "line_start": 760,
      "line_end": 765
    },
    {
      "id": "topic_16",
      "title": "Misconceptions About Evals: The Debate and Nuance",
      "summary": "The conversation addresses the significant controversy around evals on social media, with Shreya explaining that much of the debate stems from narrow definitions of evals and from people who were burned by poorly implemented LLM judges. She contextualizes that many disagreements come from different understandings of what evals include.",
      "timestamp_start": "01:10:19",
      "timestamp_end": "01:15:40",
      "line_start": 775,
      "line_end": 812
    },
    {
      "id": "topic_17",
      "title": "Claude Code and the Vibes-Based Development Debate",
      "summary": "The conversation discusses the apparent contradiction between Claude Code's success without explicit evals and the argument that evals are critical. Shreya argues that Claude Code is built on the foundation of extensive evals done on Claude's base models and is likely doing implicit error analysis.",
      "timestamp_start": "01:12:24",
      "timestamp_end": "01:14:35",
      "line_start": 781,
      "line_end": 803
    },
    {
      "id": "topic_18",
      "title": "Evals Versus A-B Testing: Complementary Not Competing",
      "summary": "Shreya and Hamel explain that evals and A-B tests are complementary parts of a data science toolkit, not opposing approaches. They argue that A-B tests should be grounded in error analysis rather than hypothesis-driven without ground truth, and that evals represent the broader data science thinking needed for AI products.",
      "timestamp_start": "01:16:38",
      "timestamp_end": "01:19:50",
      "line_start": 820,
      "line_end": 833
    },
    {
      "id": "topic_19",
      "title": "Common Misconceptions About Evals",
      "summary": "Hamel and Shreya identify the top misconceptions: (1) that AI tools can automatically do evals without human judgment, (2) that teams skip looking at actual data, and (3) that there's only one correct way to do evals. They emphasize the importance of human involvement and domain expertise.",
      "timestamp_start": "01:24:31",
      "timestamp_end": "01:26:28",
      "line_start": 877,
      "line_end": 900
    },
    {
      "id": "topic_20",
      "title": "Practical Tips for Starting and Improving Evals",
      "summary": "Shreya and Hamel provide actionable advice: don't be scared of the process, use AI to help organize thinking throughout, create custom tools to remove friction from data exploration, and remember the goal is actionable product improvement, not perfect evals.",
      "timestamp_start": "01:26:37",
      "timestamp_end": "01:29:56",
      "line_start": 904,
      "line_end": 923
    },
    {
      "id": "topic_21",
      "title": "Building Custom Interfaces for Error Analysis",
      "summary": "Hamel shows an example of a custom web application built for Nurture Boss to make data exploration frictionless, with features like channel filtering and visual error counts. He explains how AI makes building such tools accessible and cost-effective.",
      "timestamp_start": "01:28:22",
      "timestamp_end": "01:29:56",
      "line_start": 920,
      "line_end": 923
    },
    {
      "id": "topic_22",
      "title": "Time Investment and ROI of Evals",
      "summary": "Shreya shares that initial error analysis typically takes 3-4 days of focused work, then becomes a simple 30-minute weekly maintenance task. This one-time investment yields significant ongoing returns through continuous product improvement.",
      "timestamp_start": "01:30:45",
      "timestamp_end": "01:31:56",
      "line_start": 934,
      "line_end": 944
    },
    {
      "id": "topic_23",
      "title": "The Fun and Iterative Nature of Evals Work",
      "summary": "Hamel shares an anecdote about discovering generic recruiting email language through data review, emphasizing the enjoyment of putting on a product hat and critically evaluating AI outputs. He illustrates how evals work is intellectually engaging and immediately valuable.",
      "timestamp_start": "01:32:06",
      "timestamp_end": "01:33:38",
      "line_start": 949,
      "line_end": 954
    },
    {
      "id": "topic_24",
      "title": "Course Curriculum and Learning Resources",
      "summary": "Shreya and Hamel describe their Maven course curriculum covering error analysis, automated evaluators, application improvement flywheel, custom interface building, and cost optimization. They highlight the 160-page book, active Discord community, and AI bot with 10 months free access for students.",
      "timestamp_start": "01:33:51",
      "timestamp_end": "01:37:26",
      "line_start": 958,
      "line_end": 1001
    },
    {
      "id": "topic_25",
      "title": "Lightning Round: Books, Media, Products, and Philosophies",
      "summary": "Hamel and Shreya share personal recommendations including books (Pachinko, Apple in China, Machine Learning by Mitchell), entertainment (Frozen, The Wire), favorite tools (Claude Code, Cursor), and life philosophies (think like a beginner, understand the other side's argument).",
      "timestamp_start": "01:38:04",
      "timestamp_end": "01:44:06",
      "line_start": 1015,
      "line_end": 1122
    }
  ],
  "insights": [
    {
      "id": "insight_1",
      "text": "To build great AI products, you need to be really good at building evals. It's the highest ROI activity you can engage in.",
      "context": "Opening statement setting the critical importance of evals for AI product development",
      "topic_id": "topic_1",
      "line_start": 1,
      "line_end": 2
    },
    {
      "id": "insight_2",
      "text": "The goal is not to do evals perfectly, it's to actionably improve your product.",
      "context": "Core principle emphasizing the practical purpose of evaluations over theoretical perfection",
      "topic_id": "topic_1",
      "line_start": 10,
      "line_end": 11
    },
    {
      "id": "insight_3",
      "text": "Evals is a way to systematically measure and improve an AI application, and at its core, it's data analytics on your LLM application.",
      "context": "Definition of evals as data-driven product analytics rather than traditional testing",
      "topic_id": "topic_2",
      "line_start": 61,
      "line_end": 62
    },
    {
      "id": "insight_4",
      "text": "Unit tests are a very small part of that very big puzzle. Evals could be data analysis to find new cohorts, metrics tracked over time, or basic user feedback metrics.",
      "context": "Explaining the spectrum of evaluation approaches beyond code-based unit tests",
      "topic_id": "topic_2",
      "line_start": 85,
      "line_end": 89
    },
    {
      "id": "insight_5",
      "text": "Before evals, you would be left with guessing. You might fix a prompt and hope you're not breaking anything else, relying on vibe checks. As applications grow, vibe checks become unmanageable.",
      "context": "Problem statement explaining why systematic evals are necessary",
      "topic_id": "topic_2",
      "line_start": 68,
      "line_end": 71
    },
    {
      "id": "insight_6",
      "text": "Don't jump straight to writing tests. You should start with data analysis to ground what you should even test. With LLMs, there's a lot more surface area and stochasticity.",
      "context": "Cautioning against the common mistake of jumping directly to test writing without understanding failure modes first",
      "topic_id": "topic_3",
      "line_start": 94,
      "line_end": 95
    },
    {
      "id": "insight_7",
      "text": "You need to put your product hat on because product people understand the user experience. A developer might not see why something is wrong if it technically works.",
      "context": "Explaining why product managers or domain experts must lead error analysis, not just engineers",
      "topic_id": "topic_3",
      "line_start": 145,
      "line_end": 149
    },
    {
      "id": "insight_8",
      "text": "When you're doing data analysis of your LLM application, write down the first thing that's wrong—the most upstream error. Don't try to find all errors. Just capture the first thing and stop.",
      "context": "Guidance on efficient error analysis methodology",
      "topic_id": "topic_3",
      "line_start": 203,
      "line_end": 203
    },
    {
      "id": "insight_9",
      "text": "An LLM looking at a trace without context will say it's fine when it's actually hallucinating or missing product expectations. You can't automate error analysis with an LLM at this stage.",
      "context": "Demonstrating the limitation of AI in error analysis without human domain expertise",
      "topic_id": "topic_3",
      "line_start": 220,
      "line_end": 230
    },
    {
      "id": "insight_10",
      "text": "When doing open coding, a lot of teams get bogged down having a committee. You can appoint one person whose taste you trust—a benevolent dictator. You don't want to make this process so expensive that you can't do it.",
      "context": "Introducing the benevolent dictator pattern to reduce decision-making overhead",
      "topic_id": "topic_4",
      "line_start": 259,
      "line_end": 260
    },
    {
      "id": "insight_11",
      "text": "The benevolent dictator should be the person with domain expertise. For legal matters, a lawyer. For mental health, a mental health expert. Oftentimes, it's the product manager.",
      "context": "Guidance on who should lead the evaluation process",
      "topic_id": "topic_4",
      "line_start": 271,
      "line_end": 278
    },
    {
      "id": "insight_12",
      "text": "Keep looking at traces until you feel like you're not learning anything new. This is called theoretical saturation—when you're not uncovering any new failure modes or concepts.",
      "context": "Explaining the natural stopping point for error analysis based on learning curve",
      "topic_id": "topic_5",
      "line_start": 311,
      "line_end": 320
    },
    {
      "id": "insight_13",
      "text": "After doing 20 traces, you will automatically find it so useful that you will continue doing it. The intuition for when to stop develops over 2-3 rounds.",
      "context": "Practical guidance that the process becomes addictive and intuitive",
      "topic_id": "topic_5",
      "line_start": 308,
      "line_end": 323
    },
    {
      "id": "insight_14",
      "text": "When using LLMs to create axial codes, you can be very detailed in your prompt about what you want. Tell it you want actionable failure modes or to group by user story stage. There's no definitive way to do it.",
      "context": "Emphasizing the flexibility and iterative nature of using AI in the eval process",
      "topic_id": "topic_6",
      "line_start": 394,
      "line_end": 407
    },
    {
      "id": "insight_15",
      "text": "Error analysis is grounded in social science concepts like open coding and axial coding that have been around for a long time. We didn't invent this; we adapted it for LLMs.",
      "context": "Clarifying that evals methodology builds on established research traditions",
      "topic_id": "topic_6",
      "line_start": 425,
      "line_end": 428
    },
    {
      "id": "insight_16",
      "text": "Your open codes have to be detailed. You can't just say 'janky.' If an AI is reading it, it won't be able to categorize properly. Even a human would have to remember why you said janky.",
      "context": "Practical lesson on the importance of detailed documentation during error analysis",
      "topic_id": "topic_7",
      "line_start": 485,
      "line_end": 485
    },
    {
      "id": "insight_17",
      "text": "Basic counting is the most powerful analytical technique in data science because it's simple and it's undervalued. It takes chaotic observations and gives you actionable priorities.",
      "context": "Highlighting the power of simple quantitative analysis over complex methods",
      "topic_id": "topic_8",
      "line_start": 353,
      "line_end": 353
    },
    {
      "id": "insight_18",
      "text": "Not all failure modes need evals. Some are just dumb engineering errors you can fix directly. You need to make a cost-benefit tradeoff and don't want to get carried away with evals.",
      "context": "Practical guidance on deciding which failure modes warrant evaluation infrastructure",
      "topic_id": "topic_8",
      "line_start": 551,
      "line_end": 557
    },
    {
      "id": "insight_19",
      "text": "Code-based evals are cheaper and faster to build when possible. You should try to do code-based evals if you can before turning to LLM judges.",
      "context": "Cost-optimization principle for eval infrastructure",
      "topic_id": "topic_9",
      "line_start": 562,
      "line_end": 564
    },
    {
      "id": "insight_20",
      "text": "An LLM judge should output binary pass/fail, not a rating scale. A score of 3.2 versus 3.7 is meaningless and prevents decision-making. Force yourself to make a decision.",
      "context": "Critical principle for LLM judge design",
      "topic_id": "topic_10",
      "line_start": 625,
      "line_end": 629
    },
    {
      "id": "insight_21",
      "text": "If someone reports LLM judge agreement percentage without a confusion matrix, that's a red flag. Rare errors can give high agreement by chance. You must examine false positives and false negatives separately.",
      "context": "Warning against misleading evaluation metrics",
      "topic_id": "topic_11",
      "line_start": 662,
      "line_end": 675
    },
    {
      "id": "insight_22",
      "text": "Your rubrics can't be figured out upfront. People's opinions of good and bad change as they review more outputs. You can't dream up all failure modes at the beginning.",
      "context": "Fundamental insight about the emergent nature of evaluation criteria for AI systems",
      "topic_id": "topic_12",
      "line_start": 716,
      "line_end": 716
    },
    {
      "id": "insight_23",
      "text": "Evals function exactly like PRDs—they specify what the product should do. What's different is that evals are derived from actual user data and failures, making them grounded in reality.",
      "context": "Connecting evals to traditional product management practices",
      "topic_id": "topic_12",
      "line_start": 686,
      "line_end": 692
    },
    {
      "id": "insight_24",
      "text": "Most products need 4-7 LLM judge evals, not comprehensive coverage. Many failure modes can be fixed by improving the prompt. Don't write evals for everything.",
      "context": "Practical guidance on the typical scope of evaluation infrastructure",
      "topic_id": "topic_13",
      "line_start": 728,
      "line_end": 734
    },
    {
      "id": "insight_25",
      "text": "People often do A-B tests prematurely without first doing error analysis. They hypothesize what's wrong without grounding in data. Ground your hypotheses in error analysis first.",
      "context": "Explaining the proper sequence of analysis before experimentation",
      "topic_id": "topic_18",
      "line_start": 826,
      "line_end": 827
    },
    {
      "id": "insight_26",
      "text": "The debate between evals and A-B testing is really about data science thinking. You need multiple tools in your toolkit. Evals and A-B tests are complementary, not competing.",
      "context": "Reframing the apparent conflict in AI product development practices",
      "topic_id": "topic_18",
      "line_start": 832,
      "line_end": 833
    },
    {
      "id": "insight_27",
      "text": "When Claude Code engineers say they don't do evals, they're standing on the shoulders of evals done on Claude's base models. They're also implicitly doing error analysis through dogfooding.",
      "context": "Contextualizing the apparent contradiction between successful AI products and explicit eval processes",
      "topic_id": "topic_17",
      "line_start": 785,
      "line_end": 794
    },
    {
      "id": "insight_28",
      "text": "Coding agents are fundamentally different than other AI products because the developer is the domain expert and user. You can collapse activities and don't need as much data and feedback.",
      "context": "Explaining why evaluation approaches vary by product type",
      "topic_id": "topic_17",
      "line_start": 796,
      "line_end": 797
    },
    {
      "id": "insight_29",
      "text": "The biggest misconception is thinking you can buy a tool that does evals for you. It doesn't work. LLMs can't do error analysis without human judgment and domain expertise.",
      "context": "Top misconception about evals automation",
      "topic_id": "topic_19",
      "line_start": 877,
      "line_end": 878
    },
    {
      "id": "insight_30",
      "text": "The second major misconception is people not looking at their actual data. The most powerful activity is going to look at individual traces and understand what's happening.",
      "context": "Emphasizing the importance of direct data examination",
      "topic_id": "topic_19",
      "line_start": 883,
      "line_end": 884
    },
    {
      "id": "insight_31",
      "text": "There's no one correct way to do evals, but there are many incorrect ways. You have to think about where you are with your product and resources, and figure out the plan that works for you.",
      "context": "Acknowledging the flexible but not arbitrary nature of eval implementation",
      "topic_id": "topic_19",
      "line_start": 898,
      "line_end": 899
    },
    {
      "id": "insight_32",
      "text": "Don't be scared of looking at your data. The goal is not to do evals perfectly, it's to actionably improve your product. You're guaranteed to find improvements if you do any part of this process.",
      "context": "Encouraging mindset for teams starting their eval journey",
      "topic_id": "topic_20",
      "line_start": 904,
      "line_end": 905
    },
    {
      "id": "insight_33",
      "text": "Use LLMs to help you throughout the entire eval process—organizing thoughts, improving PRDs based on open codes, presenting information better. But don't use them to replace yourself.",
      "context": "Balanced perspective on AI assistance in evaluation work",
      "topic_id": "topic_20",
      "line_start": 908,
      "line_end": 914
    },
    {
      "id": "insight_34",
      "text": "Build your own tools to make it as easy as possible to look at data. With AI, you can create simple web applications to remove all friction from data exploration in a few hours.",
      "context": "Practical tip for making evaluation infrastructure accessible",
      "topic_id": "topic_21",
      "line_start": 920,
      "line_end": 923
    },
    {
      "id": "insight_35",
      "text": "Initial error analysis takes 3-4 days of focused work, but after that, it becomes 30 minutes per week. This is a one-time cost that yields ongoing returns.",
      "context": "Time investment analysis showing favorable ROI of evals",
      "topic_id": "topic_22",
      "line_start": 934,
      "line_end": 938
    },
    {
      "id": "insight_36",
      "text": "This process is a lot of fun. Everyone that does error analysis immediately gets addicted to it. You learn so much by putting your product hat on and looking at real interactions.",
      "context": "Positive reinforcement about the engaging nature of eval work",
      "topic_id": "topic_23",
      "line_start": 5,
      "line_end": 5
    }
  ],
  "examples": [
    {
      "id": "example_1",
      "explicit_text": "At Anthropic and OpenAI, both CPOs shared that evals are becoming the most important new skill for product builders.",
      "inferred_identity": "Dario Amodei (Anthropic) / Sam Altman (OpenAI)",
      "confidence": 0.85,
      "tags": [
        "Anthropic",
        "OpenAI",
        "CPO",
        "chief product officer",
        "evals",
        "AI product",
        "skill development",
        "industry leaders"
      ],
      "lesson": "Evals have emerged as a critical competency that is recognized and prioritized at the highest levels of leading AI companies.",
      "topic_id": "topic_1",
      "line_start": 32,
      "line_end": 32
    },
    {
      "id": "example_2",
      "explicit_text": "Real estate assistant application from Nurture Boss, a company that AI assists property managers with inbound leads, customer service, booking appointments, and operations.",
      "inferred_identity": "Nurture Boss",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "property management",
        "real estate",
        "AI assistant",
        "customer service",
        "operations",
        "tool calling",
        "RAG",
        "voice chat",
        "text messaging"
      ],
      "lesson": "Complex AI applications with multiple channels, tool integrations, and external data sources require systematic error analysis to identify failures across many dimensions.",
      "topic_id": "topic_3",
      "line_start": 98,
      "line_end": 101
    },
    {
      "id": "example_3",
      "explicit_text": "User asked about one-bedroom with study availability. The AI said we have one-bedroom apartments but none with study, then when asked when one with study would be available, said it doesn't have specific information.",
      "inferred_identity": "Nurture Boss property management example",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "hallucination",
        "incomplete response",
        "customer service failure",
        "real estate",
        "availability query",
        "product mistake"
      ],
      "lesson": "AI systems can technically process information correctly but fail to provide the helpful next step that a human would offer, requiring a handoff to human agents.",
      "topic_id": "topic_3",
      "line_start": 128,
      "line_end": 137
    },
    {
      "id": "example_4",
      "explicit_text": "Text message interaction where customer wrote 'Okay, I've been texting you all day. Please.' The system received garbled input because text messages are short phrases split across multiple turns.",
      "inferred_identity": "Nurture Boss text messaging feature",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "text messaging",
        "channel issue",
        "technical error",
        "conversation fragmentation",
        "multi-turn handling"
      ],
      "lesson": "Different communication channels (text vs. voice) require different handling logic. Text messaging's characteristic pattern of short, fragmented messages needs special consideration.",
      "topic_id": "topic_3",
      "line_start": 169,
      "line_end": 173
    },
    {
      "id": "example_5",
      "explicit_text": "User asked 'Do you have a one-bedroom with study available?' and 'Do you provide virtual tours?' The system responded 'We do offer virtual tours. You can schedule a tour,' but there was no virtual tour functionality.",
      "inferred_identity": "Nurture Boss hallucination example",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "hallucination",
        "false capability",
        "virtual tours",
        "real estate",
        "dangerous error",
        "feature claiming"
      ],
      "lesson": "Hallucinations about product capabilities are particularly dangerous in customer-facing applications because they create false expectations and erode trust.",
      "topic_id": "topic_3",
      "line_start": 206,
      "line_end": 209
    },
    {
      "id": "example_6",
      "explicit_text": "User asked about specials, received 5% military discount, then asked about floor availability and one-bedroom availability. The AI called transfer call without confirming with the user.",
      "inferred_identity": "Nurture Boss call transfer failure",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "handoff issue",
        "abrupt transfer",
        "poor UX",
        "customer service",
        "process violation"
      ],
      "lesson": "Transferring to humans should be explicit and customer-initiated when possible, not abrupt and unannounced, to maintain customer experience quality.",
      "topic_id": "topic_3",
      "line_start": 289,
      "line_end": 293
    },
    {
      "id": "example_7",
      "explicit_text": "Claude was used to analyze a CSV of open codes and create axial codes, automatically categorizing failure modes into groups like 'capability limitations', 'misrepresentation', 'process violations', 'handoff issues', 'communication quality'.",
      "inferred_identity": "Claude API / Claude model",
      "confidence": 0.95,
      "tags": [
        "Claude",
        "LLM",
        "data analysis",
        "categorization",
        "axial coding",
        "text analysis",
        "automation",
        "Anthropic"
      ],
      "lesson": "LLMs are highly effective at synthesizing open codes into organized categories, but the results still require human review and refinement for actionability.",
      "topic_id": "topic_6",
      "line_start": 356,
      "line_end": 383
    },
    {
      "id": "example_8",
      "explicit_text": "Andrew Ng, a famous machine learning researcher, is shown in an 8-year-old video discussing error analysis as a technique for analyzing stochastic systems.",
      "inferred_identity": "Andrew Ng",
      "confidence": 1.0,
      "tags": [
        "Andrew Ng",
        "Stanford",
        "machine learning",
        "error analysis",
        "research",
        "stochastic systems",
        "foundational technique"
      ],
      "lesson": "Error analysis is a well-established research technique from machine learning that has direct applicability to modern LLM systems.",
      "topic_id": "topic_6",
      "line_start": 440,
      "line_end": 443
    },
    {
      "id": "example_9",
      "explicit_text": "Using Google Sheets AI to automatically categorize open codes into axial code buckets like 'tour scheduling', 'human handoff issues', 'formatting errors', 'conversational flow'.",
      "inferred_identity": "Google Sheets AI functionality",
      "confidence": 0.95,
      "tags": [
        "Google Sheets",
        "Google AI",
        "spreadsheet automation",
        "categorization",
        "Gemini",
        "accessible tools"
      ],
      "lesson": "Spreadsheet-based AI assistance makes advanced data analysis accessible to product teams without requiring specialized technical infrastructure.",
      "topic_id": "topic_7",
      "line_start": 472,
      "line_end": 482
    },
    {
      "id": "example_10",
      "explicit_text": "Pivot table showing: 17 conversational flow issues, multiple human handoff issues, formatting errors with output, making follow-up promises not kept.",
      "inferred_identity": "Nurture Boss error frequency analysis",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "data analysis",
        "pivot table",
        "error quantification",
        "prioritization",
        "failure modes"
      ],
      "lesson": "Visualizing error frequency through simple counting transforms qualitative observations into actionable priorities for product improvement.",
      "topic_id": "topic_8",
      "line_start": 545,
      "line_end": 551
    },
    {
      "id": "example_11",
      "explicit_text": "LLM judge prompt for evaluating whether an AI should hand off to a human, checking against specific criteria like explicit human requests, policy-mandated transfers, sensitive issues, tool data unavailability.",
      "inferred_identity": "Nurture Boss handoff eval",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "LLM judge",
        "handoff evaluation",
        "binary decision",
        "property management",
        "customer service"
      ],
      "lesson": "Well-crafted LLM judge prompts make subjective quality judgments actionable by defining specific, clear criteria for pass/fail decisions.",
      "topic_id": "topic_10",
      "line_start": 647,
      "line_end": 650
    },
    {
      "id": "example_12",
      "explicit_text": "Confusion matrix analysis showing human judgment vs. LLM judge judgment, with focus on false positives and false negatives rather than overall agreement percentage.",
      "inferred_identity": "Nurture Boss eval validation",
      "confidence": 1.0,
      "tags": [
        "Nurture Boss",
        "LLM judge",
        "validation",
        "confusion matrix",
        "error types",
        "evaluation quality"
      ],
      "lesson": "Confusion matrices provide much more actionable feedback than overall agreement percentages, revealing which types of errors the judge is making.",
      "topic_id": "topic_11",
      "line_start": 670,
      "line_end": 675
    },
    {
      "id": "example_13",
      "explicit_text": "Shreya's research paper titled 'Who Validates the Validated?' showed that people's evaluation criteria drift and change as they review more outputs, especially with LLM systems.",
      "inferred_identity": "Shreya Shankar research",
      "confidence": 1.0,
      "tags": [
        "Shreya Shankar",
        "research",
        "academia",
        "criteria drift",
        "evaluation methodology",
        "user study",
        "LLM evaluation"
      ],
      "lesson": "Evaluators naturally refine their standards and discover new failure modes through exposure to real outputs, making upfront rubrics necessarily incomplete.",
      "topic_id": "topic_12",
      "line_start": 698,
      "line_end": 716
    },
    {
      "id": "example_14",
      "explicit_text": "Claude Code is reportedly built on vibes and doesn't explicitly do evals, yet it's highly successful, sparking debate about whether evals are necessary.",
      "inferred_identity": "Claude Code / Anthropic",
      "confidence": 0.85,
      "tags": [
        "Claude Code",
        "Anthropic",
        "coding agent",
        "vibes-based development",
        "implicit evals",
        "debate"
      ],
      "lesson": "Successful AI products with expert domain user-developers may succeed through implicit evals and intensive dogfooding, but this doesn't generalize to all product types.",
      "topic_id": "topic_17",
      "line_start": 781,
      "line_end": 794
    },
    {
      "id": "example_15",
      "explicit_text": "OpenAI has gone to lengths to analyze Twitter sentiment and Reddit threads complaining about their products and tied that back to improvements, demonstrating the breadth of evals thinking.",
      "inferred_identity": "OpenAI",
      "confidence": 0.9,
      "tags": [
        "OpenAI",
        "GPT",
        "public sentiment",
        "social listening",
        "product feedback",
        "data analysis"
      ],
      "lesson": "Evals thinking extends beyond automated metrics to include synthesis of public feedback and sentiment to inform product direction.",
      "topic_id": "topic_18",
      "line_start": 845,
      "line_end": 848
    },
    {
      "id": "example_16",
      "explicit_text": "A recruiting email being sent with generic opening 'Given your background, blah blah blah.' When examined critically, it's an obvious generic recruiting email that would be deleted by recipients.",
      "inferred_identity": "Email generation AI (Hamel's consulting client)",
      "confidence": 0.7,
      "tags": [
        "email assistant",
        "recruiting",
        "outreach",
        "generic messaging",
        "product failure",
        "UX review"
      ],
      "lesson": "Putting on a product hat and critically examining individual outputs reveals failures that automated metrics might miss, like tone-deaf or generic messaging.",
      "topic_id": "topic_23",
      "line_start": 950,
      "line_end": 953
    },
    {
      "id": "example_17",
      "explicit_text": "Hamel and Shreya teach a course on Maven that is the number one highest-grossing course on the platform, with 2,000+ PMs and engineers trained across 500 companies.",
      "inferred_identity": "Maven course platform",
      "confidence": 1.0,
      "tags": [
        "Maven",
        "online education",
        "evals course",
        "popular course",
        "product management",
        "engineering education"
      ],
      "lesson": "There is significant demand for structured education on evals, indicating this is a critical emerging skill that organizations are actively trying to develop.",
      "topic_id": "topic_24",
      "line_start": 35,
      "line_end": 35
    },
    {
      "id": "example_18",
      "explicit_text": "The course includes 160-page book, custom AI bot with 10 months free access, live coding interfaces for error analysis, cost optimization techniques, and active Discord community.",
      "inferred_identity": "Hamel and Shreya's Maven course",
      "confidence": 1.0,
      "tags": [
        "Maven course",
        "educational materials",
        "AI bot",
        "community",
        "practical training",
        "comprehensive curriculum"
      ],
      "lesson": "Effective AI education now integrates multiple learning modalities including documentation, AI assistants, live coding, and community support.",
      "topic_id": "topic_24",
      "line_start": 964,
      "line_end": 980
    }
  ]
}